Decreasing End-to-End Job Execution Times by Increasing Resource Utilization using Predictive Scheduling in the Grid

نویسنده

  • Ioan Raicu
چکیده

The Grid has the potential to grow significantly over the course of the next decade and therefore the mechanisms that make the Grid possible need to become more efficient in order for the Grid to scale. One of these mechanisms revolves around resource management; ultimately, there will be so many resources in the Grid, that if they are not managed properly, only a very small fraction of those resources will be utilized. While good resource utilization is very important, it is also a hard problem due to widely distributed dynamic environments normally found in the Grid. It is important to develop an experimental methodology for automatically characterizing grid software in a manner that allows accurate evaluation of the software’s behavior and performance before deployment in order to make better informed resource management decisions. Many Grid services and software are designed and characterized today largely based on the designer’s intuition and on ad hoc experimentation; having the capability to automatically map complex, multi-dimensional requirements and performance data among resource providers and consumers is a necessary step to ensure consistent good resource utilization in the Grid. This automatic matching between the software characterization and a set of raw or logical resources is a much needed functionality that is currently lacking in today’s Grid resource management infrastructure. Ultimately, my proposed work, which addresses performance modeling with the goal to improve resource management, could ensure that the efficiency of the resource utilization in the Grid will remain high as the size of the Grid grows. 1.0 Introduction Through my previous work, I have shown that DiPerF [1, 2, 3] can be used to model the performance characteristics of a service in a client/server scenario. Using the basic concept of DiPerF, I believe I can also create performance models that can be used to predict the future performance of distributed applications. Modeling distributed applications (i.e. parallel computational programs) might be more challenging since the dataset on which the applications work against often influence the performance of the application, and therefore general predictive models might not be sufficient. Using DiPerF and a small dedicated cluster of machines, we can build dynamic performance models to automatically map raw hardware resources to the performance of a particular distributed application and its representative workload; in essence, these dynamic performance models can be thought of as job profiles, and will be implemented in the component DiProfile. The intuition behind DiProfile is that based on some small sample workload (with varying sizes) and a small set of resources (with varying size), we can make predictions regarding the execution time and resource utilization of the entire job running over the complete dataset. The DiProfile stage will be a relatively expensive component in both time and computational resources, however its overhead will be warranted as long as the typical job submitted is significantly larger than the amount of time DiProfile needs to build its dynamic performance models. There is a gap between software requirements (high level) and hardware resources (low level). Automatic mapping could produce better scheduling decisions and give users feedback with the expected running time of their software. Using DiProfile, we can make predictions on the performance of the jobs based on the amount of raw resources dedicated to the jobs. The accuracy of the predictions will heavily rely on the idea that reliable software performance characterization is possible with only a fraction of the data input space. Using DiPred and user feedback (or even user specified high level performance goals), the scheduler (DiSched) can make better decisions to satisfy the requested duration of the job, where the job should be placed, etc. Since jobs are Ioan Raicu Department of Computer Science University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 [email protected] March 16, 2005 Research Proposal Page 2 of 20 CMSC33340 Grid Computing profiled based on what raw resources they will likely consume and the duration of those resource usage, multiple different jobs could be simultaneously submitted to the same nodes without any significant loss of individual job performance; this would certainly increase resource utilization and as long as the predicted resource usage does not exceed the available resources, the time it takes to complete individual jobs should not be significantly affected. The increase of resource utilization is possible as long as the assumption that several different classes of software can be concurrently executed without significant loss of performance. It is expected that Resource Managers could use a combination of resource selection algorithms besides the proposed DiSched component. To ensure that the available resources that the scheduler is aware of is maintained and updated, resource monitoring (i.e. Ganglia, MDS, etc...) will also be necessary. Some current resource managers use resource monitoring to make scheduling decisions, but often only one job is normally submitted to each individual resource. However, combining resource monitoring with predictive scheduling has the potential to not only improve scheduling decisions to yield lower end-to-end job execution times, but to increase resource utilization significantly. 2.0 Related Work The goals of mapping software requirements to available resources has been studied extensively, however, in practice, it is still a relatively manual process and is application specific. One approach is to perform benchmarks on resources, however this still requires user specified relationships between benchmarks and software requirements to be established. Performing these benchmarks on the resources acts as a “stepping stone” towards gaining insight about the performance of a particular set of resources, and what kind of problems could possibly benefit the most from them. Regarding application performance models, there seems to be two general methods, workflow analysis and the use of compilers to some degree. The drawbacks of workflow analysis is that it is a hard problem and time consuming to produce accurate workflows. As for compiler technology, it requires user intervention to model complex applications. Furthermore, performance models might not reflect actual performance on various different architectures. What I propose is to seek an automatic mapping between software requirements and available resources using “black box” approach, which would be ideally generic and application independent. In principle, performance models could be built to be classified into several software classes based on the type and amount of resource usage. The models would take into consideration interactions between various components in the system, and across distributed and different systems. Best of all, applications could have performance models built without any modifications or expert knowledge about the particular software. Some of the lessons learned from the related work are that benchmarking of Grid resources could be used to enhance the resource selection, especially in the heterogeneous systems that are often found in today’s Grids. Co-scheduling could also be used based on the software classes established to increase the resource utilization. Furthermore, historical information could be used to recall the performance models generated for commonly used software in order to save the cost of generating a new model. Much of the surveyed work concentrated on lower levels of scheduling (i.e. local scheduling); it is important to address scheduling decisions on a larger global scale, and perhaps giving hints to the low level schedulers in order to improve the low level local scheduling. 2.1 Benchmarking of Resources GridBench [30, 31] is a set of tools that aim to facilitate the characterization of Grid nodes or collections of Grid resources. In order to perform benchmarking measurements in an organized and flexible way, we provide the GridBench framework as a means for running benchmarks on Grid environments as well as collecting, archiving, and publishing the results. This data is available for retrieval not only by end users, but also for automated decisionmakers such as schedulers. A scheduler could use micro-benchmark results to “rank” the resources based on performance (CPU, memory or MPI). Additionally, a scheduler could evaluate a resource's “health” by invoking one of the micro-benchmarks. Since execution times are typically less than 10 seconds, this would impose little additional delay and would potentially save a scheduler from time-consuming failed submissions. Once the relationship between the micro-benchmarks and the application kernel has been established (in a manual fashion that involves human intervention and detailed knowledge of the application or its empirical performance) it can then be applied to resource selection. Elmroth et al [38] presents algorithms, methods, and software for a Grid resource manager, responsible for resource brokering and scheduling in early production Grids. The broker selects computing resources based on actual job requirements and a number of criteria identifying the available resources, with the aim to minimize the total time to March 16, 2005 Research Proposal Page 3 of 20 CMSC33340 Grid Computing delivery for the individual application. The total time to delivery includes the time for program execution, batch queue waiting, input/output data transfer, and executable staging. Main features of the resource manager include advance reservations, resource selection based on computer benchmark results and network performance predictions, and a basic adaptation facility. The performance differences between Grid resources and the fact that their relative performance characteristics may vary for different types of applications makes resource selection difficult. Our approach to handle this is to use a benchmark-based procedure for resource selection. Based on the user’s identification of relevant benchmarks and an estimated execution time on some specified resource, the broker estimates the execution time for all resources of interest. The authors assume linear scaling of the application in relation to the benchmark, i.e., a resource with a benchmark result a factor k better is assumed to execute the application a factor k faster. For each of these benchmarks, the user needs to specify a benchmark result and an expected execution time on a system corresponding to that benchmark result. The time estimation for the stage in and stage out procedures are based on the actual (known) sizes of input files and the executable file, user-provided estimates for the sizes of the output files, and network bandwidth predictions. The network bandwidth predictions are performed using the Network Weather Service (NWS) [42]. NWS combines periodic bandwidth measurements with statistical methods to make short-term predictions about the available bandwidth. 2.2 Workflow Analysis Spooner et al. [32] developed a multi-tiered scheduling architecture (TITAN) that employs a performance prediction system (PACE) and task distribution brokers to meet user-defined deadlines and improve resource usage efficiency. This work focused on the lowest tier which is responsible for local scheduling. By coupling application performance data with scheduling heuristics, the architecture is able to balance the processes of minimizing run-to-completion time and processor idle time, whilst adhering to service deadlines on a per-task basis. The PACE system provides a method to predict the execution time dynamically, given an application model and suitable hardware descriptions. The hardware (resource) descriptions are generated when the resources are configured for use by TITAN, and the application models are generated prior to submission. PACE models are modular, consisting of application, subtask, parallel and resource objects. Application tools are provided that take C source code and generate sub-tasks that capture the serial components of the code by control flow graphs. It may be necessary to add loop and conditional probabilities to the sub-tasks where data cannot be identified via static analysis. The parallel object is developed to describe the parallel operation of the application. This can be reasonably straightforward for simple codes, and a library of templates exists for standard constructs. Applications that exhibit more complex parallel operations may require customization. The sub-tasks and parallel objects are compiled from a performance specification language (PSL) and are linked together with an application object that represents the entry point of the model. Resource tools are available to characterize the resource hardware through micro-benchmarking and modeling techniques for communication and memory hierarchies. The resultant resource objects are then used as inputs to an evaluation engine which takes the resource objects and parameters to produce predictive traces and an estimated execution time. Scheduling issues are addressed at this level using task scheduling algorithms driven by PACE performance predictions. In summary, the techniques presented have been developed into a working system for scheduling parallel tasks over a heterogeneous network of resources. The Genetic Algorithm forms the centre point of the localized workload managers and is responsible for selecting, creating and evaluating new schedules. Prophesy [40] is an infrastructure for performance analysis and modeling of parallel and distributed applications. Prophesy includes three components: automatic instrumentation of applications, databases for archival of information, and automatic development of performance models using different techniques. The default mode consists of instrumenting the entire code via PAIDE at the level of loops and procedures. PAIDE includes a parser that identifies where to insert instrumentation code. PAIDE also generates two files: (1) the call graph of the application and (2) the locations in the code where instrumentation was inserted. The information in these two files allows the performance data to be directly related to the application code for code tuning. A user can specify that the code be instrumented at different levels of granularity or manually insert directives for the instrumenting tool to instrument specific segments of code. The resultant performance data is automatically placed in the performance database. This data is used by the data analysis component to produce an analytical performance model at the level of granularity specified by the user, or answer queries about the best implementation of a given function. The models are developed based upon performance data from the performance database, model templates from the template database, and system characteristics from the systems database. These models can be used to predict the performance of the application under different system configurations. Currently, Prophesy includes three methods for developing analytical models for predictions: (1) curve fitting, (2) parameterized model of the application code, (3) coupling of the kernel models. The advantage of curve fitting is the ease for which the analytical model is March 16, 2005 Research Proposal Page 4 of 20 CMSC33340 Grid Computing generated; the disadvantage is the lack of exposure of system terms versus application terms. Hence, models resulting from curve fitting can be used to explore application scalability but not different system configurations. Parameterization is a method that combines manual analysis of the code with system performance measurements. The manual analysis entails hand-counting the number of different operations in the code. Having the system and application terms represented explicitly, one can use the resultant models to explore what happens under different system configurations as well as application sizes. The disadvantage of this method is the time required for manual analysis. Kernel coupling refers to the effect that kernel i has on kernel j in relation to running each kernel in isolation. The two kernels can correspond to adjacent kernels in the control flow of the application or a chain of three or more kernels. The coupling value provided insight into where further algorithm and code implementation work was needed to improve performance, in particular the reuse of data between kernels. Pegasus [43, 44] (Planning for Execution in Grids) was developed at ISI as part of the GriPhyN and SCEC/IT projects. Pegasus is a configurable system that can map and execute complex workflows on the Grid. Currently, Pegasus relies on a full-ahead-planning to map the workflows. The main difference between Pegasus and other work on workflow management is that while most of the other system focus on resource brokerage and scheduling strategies while Pegasus uses the concept of virtual data and provenance to generate and reduce the workflow based on data products which have already been computed earlier. It prunes the workflow based on the assumption that it is always more costly the compute the data product than to fetch it from an existing location. Pegasus also automates the job of replica selection so that the user does not have to specify the location of the input data files. Pegasus can also map and schedule only portions of the workflow at a time, using just in-time planning techniques. Although Pegasus provides a feasible solution, it is not necessarily a low cost one in term of performance. Jang et al [33] presents a resource planner system consisting of the Pegasus [43, 44] workflow management and mapping system combined with the Prophesy [40] performance modeling infrastructure. Pegasus is used to map an abstract workflow description onto the available grid resources. The abstract workflow indicates the logical transformations that need to be performed, their order and the logical data file that they consume and produce. The abstract workflow does not include information about where to execute the transformations nor where the data is located. Pegasus uses various grid services to find the available resources, the needed data and the executables that correspond to the transformations. Pegasus also reduces the abstract workflow if the intermediate products are found to already exist somewhere in the grid environment. One of the ways that Pegasus maps a workflow onto the available resources is through random allocation. This work interfaces Pegasus and Prophesy to enable Pegasus to use the Prophesy prediction mechanisms to make more informed resources choices. The goal of the Grid Application Development Software Project (GrADS) [28] is to realize a Grid system, by providing tools, such as problem solving environments, Grid compilers, schedulers, performance monitors, to manage all the stages of application development and execution. Using GrADS the user will only concentrate on high-level application design without putting attention to the peculiarities of the Grid computing platform used. The GrADS system is composed of three main components: Program Preparation System (PPS), Configurable Object Program (COP), and Program Execution System (PES). The PPS component handles application development, composition, and compilation. To develop their Grid application, users interact with a high-level interface providing a problem solving environment, which permits the integration of the application source code, software components and library modules. Then, the resulting application is passed to a specialized GrADS compiler that generates an intermediate representation code and a configurable object program (COP). The COP encapsulates all results (e.g. application performance models and the intermediate application representation code) of the PPS phase for later usage. The PES components provides on-line resource discovery, scheduling, binding, application performance monitoring, and rescheduling. To execute an application, the user submits parameters of the problem such as problem size to the GrADS system. The PPE component receives the COP as input and, at this stage, the scheduler carries out an application-appropriate schedule. The binder is then invoked to perform a final, resource-specific compilation of the intermediate representation code. Next, the executable is launched on the selected Grid resources and a real-time performance monitor is used to track program performance and detect violation of performance guarantees. Performance guarantees are formalized in a performance contract. In the case of a performance contract violation, the rescheduler is invoked to evaluate alternative schedules. The scheduler is a key component of the GrADS system. In GrADS, scheduling decisions are taken by exploiting application characteristics and requirements in order to obtain the best application execution time. Dail et al [37] proposed an application scheduling approach designed to improve the performance of parallel applications in Computational Grid environments. The approach is general and can be applied to a range of applications in a variety of execution environments. This flexibility is achieved through a decoupling of the March 16, 2005 Research Proposal Page 5 of 20 CMSC33340 Grid Computing scheduler core (the search procedure) from the application-specific (e.g. application performance models) and platform-specific (e.g. collection of resource information) components used by the search procedure. While the scheduler can be used in a stand-alone fashion, it has been designed specifically for a larger program development environment such as the Grid Application Development Software (GrADS) project. The decoupled design allows integration with other GrADS components to provide transparent and generic scheduling. To provide application appropriate scheduling, the system depends on the availability of two application-specific components: a performance model and a mapper. As the GrADS system matures, the authors hope to obtain such components automatically from application development tools such as the GrADS compiler. The performance model is an analytic metric for the performance expected of the application on a given set of resources. The mapper provides directives for mapping logical application data or tasks to physical resources. To validate their approach in the absence of such facilities, they hand-built performance models and mappers for two applications: Game of Life and Jacobi. The performance of the various scheduling methods was: local MDS cache ~ 4.5 seconds; remote NWS nameserver ~ 62.4 seconds; remote NWS nameserver and remote MDS server ~ 1088.4 seconds. The authors assume that a Grid user will probably be willing to wait 60 seconds for scheduling, but will probably not be willing to wait 1000 seconds. These results indicate that, given the technologies available at the time of their experiments (2002), their scheduling approach is only feasible when used with a local MDS cache. AppLeS (Application Level Scheduling) [45] is a project leaded by F. Berman at the University of California, S. Diego. It is a methodology for adaptive application scheduling on heterogeneous computing platforms. The AppLes approach exploits static and dynamic resource information, performance predictions, application and user-specific information, and scheduling techniques that adapt “on-the-fly” to application execution. In Figure 2 the phases in the Apples scheduling methodology are shown. As we can see, the System Selection phase of the general scheduler architecture of Figure 1, in AppLeS is split into three sub-phases: (2) Resource selection, (3) Schedule generation, and, (4) Schedule selection. During sub-phase (2) the resources enabled to run the application are selected according to application-specific resource selection models. To this end, AppLeS uses information carried out by the Network Weather Service (NWS) performance monitor (NWS is a distributed system that periodically monitors and dynamically forecasts the performance various network and computational resources can deliver over a given time interval). An ordered list of viable resources is finally produced. In (3) a performance model is applied to determine a set of candidate schedules for the application on the selected resources (for any given set of resources, many schedules may be possible). In (4) the schedule that best matches the chosen performance criteria is selected. The AppLeS approach requires to integrate in the application a scheduling agent which must be customized according to application features. In order to make easier this customization, templates to be applied to classes of applications with common characteristics were introduced. Templates for parameter sweep applications (APST), master/worker applications (AMWAT), and for scheduling moldable jobs on spaceshared parallel supercomputers (SA) are currently available. 2.3 Co-Scheduling Frachtenberg et al. [34, 35] performed a detailed performance evaluation of 5 factors affecting scheduling systems running dynamic workloads: multiprogramming level, time quantum, gang scheduling, backfilling, and flexible coscheduling [36]. The results demonstrated the importance of both components of the gang-scheduling plus backfilling combination: gang scheduling reduced response time and slowdown, and backfilling allowed doing so with a limited multiprogramming level. This was further improved by using flexible co-scheduling rather than strict gang scheduling, as this reduced the constraints and allowed for a denser packing. Multiprogramming on parallel machines may be done using two orthogonal mechanisms: time slicing and space slicing. With time slicing, each processor runs processes belonging to many different jobs concurrently, and switches between them. With space slicing, the processors are partitioned into groups that serve different jobs. Gang scheduling is a technique that combines the two approaches: all processors are time-slices in a coordinated manner, and in each time slot, they are partitioned among multiple jobs. Gang scheduling may be limited due to memory constraints. Backfilling is an optimization that improves the performance of pure space slicing by using small jobs from the end of the queue to fill in holes in the schedule, however to do so, it requires users to provide estimates of job run times. The flexible co-scheduling employs dynamic process classification and schedules processes using this class information. Processes are categorized into one of four classes: CS (coscheduling), F (frustrated), DC (don’t-care), and RE (rateequivalent). CS processes communicate often, and must be coscheduled (gang-scheduled) across the machine to run effectively, due to their demanding synchronization requirements. F processes have enough synchronization requirements to be co-scheduled, but due to load imbalance, they often cannot make full use of their allotted CPU time. This load imbalance can result from any of the reasons detailed in the introduction. DC processes rarely March 16, 2005 Research Proposal Page 6 of 20 CMSC33340 Grid Computing synchronize, and can be scheduled independently of each other without penalizing the system’s utilization or the job’s performance. For example, a job using a coarse-grained workpile model would be categorized as DC. RE processes are characterized by jobs that have little synchronization, but require a similar (balanced) amount of CPU time for all their processes. Processes are categorized based on measuring process statistics; this was achieved by implementing a lightweight monitoring layer that was integrated with MPI. Synchronous communication primitives in MPI call one of four low-latency functions to note when the process starts/ends a synchronous operation and when it enters and exits blocking mode. Applications only need to be re-linked with the modified MPI library, without any change. The accuracy of this monitoring layer has been verified using synthetic applications for which the measured parameters are known in advance, and found to be precise within 0.1%. In summary, batch and gang scheduling perform poorly under dynamic or load-imbalanced workloads, whereas implicit co-scheduling suffers from performance penalties for fine-grained synchronous jobs. Most job schedulers offer little adaptation to externallyand internally fragmented workloads, resulting in reduced machine utilization and response times. On the other hand, Flexible Co-Scheduling was designed specifically to alleviate these problems by dynamically adjusting scheduling to varying workload and application requirements. 2.4 Other Liu et al [39] presents a general-purpose resource selection framework that addresses the problems of first discovering and then organizing resources to meet application requirements by defining a resource selection service for locating Grid resources that match application requirements. At the heart of this framework is a simple, but powerful, declarative language based on a technique called set matching, which extends the Condor matchmaking framework to support both single resource and multiple-resource selection. This framework also provides an open interface for loading application-specific mapping modules to personalize the resource selector. Within this framework, both application resource requirements and application performance models are specified declaratively, in the ClassAd language, while mapping strategies can be determined by user-supplied code. Liu et al [46] extended this work by designing and implementing a description language, RedLine, for expressing constraints associated with resource consumers (requests) and resource providers. They have also implemented a matchmaking process that uses constraint-solving techniques to solve the combinatorial satisfaction problems that arise when resolving constraints. The resulting system has significantly enhanced expressiveness compared with previous approaches, being able to deal with requests that involve multiple resources and that express constraints on policies as well as properties. Some of the various methods used or proposed in the literature for extracting performance models are: • Genetic Algorithm [32, 41] • Simulated Annealing [41]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

A New Job Scheduling in Data Grid Environment Based on Data and Computational Resource Availability

Data Grid is an infrastructure that controls huge amount of data files, and provides intensive computational resources across geographically distributed collaboration. The heterogeneity and geographic dispersion of grid resources and applications place some complex problems such as job scheduling. Most existing scheduling algorithms in Grids only focus on one kind of Grid jobs which can be data...

متن کامل

Dynamic grid scheduling with job migration and rescheduling in the GridLab resource management system

Grid computing has become one of the most important research topics that appeared in the field of computing in the last years. Simultaneously, we have noticed the growing popularity of new Web-based technologies which allow us to create application-oriented Grid middleware services providing capabilities required for dynamic resource and job management, monitoring, security, etc. Consequently, ...

متن کامل

Cost-based Job grouping and Scheduling Algorithm for Grid computing Environments

The integration of remote and diverse resources and the increasing computational needs of Grand challenges problems combined with faster growth of the internet and communication technologies leads to the development of global computational grids. Grid computing is a prevailing technology, which unites underutilized resources in order to support sharing of resources and services distributed acro...

متن کامل

Multi-cost job routing and scheduling in Grid networks

Akeyproblem inGrid networks is how to efficientlymanage the available infrastructure, in order to satisfy user requirements and maximize resource utilization. This is in large part influenced by the algorithms responsible for the routing of data and the scheduling of tasks. In this paper, we present several multicost algorithms for the joint scheduling of the communication and computation resou...

متن کامل

Hierarchical Scheduling Algorithm for Grid Computing Environment

Grid computing is increasingly being viewed as the next phase of distributed computing. Grid aims to maximize the utilization of an organization's computing resources by making them shareable across applications. In grid computing, job scheduling is an important task. Load balancing and proper resource allocation are critical issues that must be considered in managing a grid computing environme...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005